Lecture 6: A Grand Tour of Machine Learning

The Big Picture

Complementary Fields

Statistics in All of Its Forms

Machine Learning is a Multivariate Statistical Analysis

Categories of Machine Learning

A Textbook

Exploring Machine Learning with the AirBnB Data

library(data.table)
dat <- fread(input = "AirBnB analysisData.csv", sep = ",", 
    fill = TRUE)
dim(dat)
[1] 29142    96

Some Data Cleaning Steps

num.bedrooms.name <- "bedrooms"
num.bathrooms.name <- "bathrooms"
lat.name <- "latitude"
long.name <- "longitude"
cozy.name <- "cozy"
description.name <- "description"
the.variables <- c(num.bedrooms.name, num.bathrooms.name, lat.name, long.name)

dat[, eval(cozy.name) := 0]
dat[grep(pattern = cozy.name, x = tolower(get(description.name))), eval(cozy.name) := 1]
dat[, c(eval(num.bedrooms.name), eval(num.bathrooms.name), eval(lat.name), eval(long.name)) := lapply(X = .SD, FUN = "as.numeric"), .SDcols = the.variables]

Notice how multiple columns can be assigned in a single step with this use of lapply and .SDcols.

Unsupervised Learning

Distance Functions

Different Methods of Clustering

K-Means Clustering

Implementing K-Means

clust.dat <- dat[, .SD, .SDcols = the.variables]
clust.kmeans <- kmeans(x = clust.dat, centers = 5)
plot(x = clust.dat[, get(long.name)], y = clust.dat[, get(lat.name)], 
    col = clust.kmeans$cluster, cex = 0.3, xlab = "Longitude", 
    ylab = "Latitude", las = 1)

Hierarchical Clustering

Implementing Hierarchical Clustering

toc <- Sys.time()
clust.hier <- hclust(d = dist(x = as.matrix(clust.dat[1:100])))
tic <- Sys.time()
print(tic - toc)
Time difference of 0.003302813 secs

Plotting the Hierarchy

plot(clust.hier)

Some Problems with Clustering

Other Forms of Unsupervised Learning

Supervised Learning

Types of Outcomes

Describing the Types

This list is not necessarily comprehensive but provides a good overview.

A Range of Methods For Each Setting

Differing Objectives

Understanding the Goals

Levels of Information

Telling a Story Versus Accuracy

And What are We Estimating/Predicting?

Judging Accuracy

my.rmse <- function(predicted, actual, na.rm = TRUE) {
    return(sqrt(mean((predicted - actual)^2, na.rm = na.rm)))
}
percentage.correctly.classified <- function(predicted, actual, 
    na.rm = TRUE) {
    return(mean(predicted == actual, na.rm = na.rm))
}

Training and Testing Sets

training.row.name <- "training_row"
dat[, eval(training.row.name) := sample(x = c(TRUE, FALSE), size = .N, replace = TRUE, prob = c(0.8, 0.2))]

Overfitting

Catalogging the Methods of Supervised Learning

Categories of Methods

Classical Models

A number of techniques are considered the classical statistical models.

We will discuss each of these classical methods in greater detail.

Linear Regression

Linear Regression: Setting

Creating Formulas

From our previous lecture:

create.formula <- function(outcome.name, input.names, input.patterns = NA, 
    all.data.names = NA, return.as = "character") {
    variable.names.from.patterns <- c()
    if (!is.na(input.patterns[1]) & !is.na(all.data.names[1])) {
        pattern <- paste(input.patterns, collapse = "|")
        variable.names.from.patterns <- all.data.names[grep(pattern = pattern, 
            x = all.data.names)]
    }
    all.input.names <- unique(c(input.names, variable.names.from.patterns))
    all.input.names <- all.input.names[all.input.names != 
        outcome.name]
    if (!is.na(all.data.names[1])) {
        all.input.names <- all.input.names[all.input.names %in% 
            all.data.names]
    }
    input.names.delineated <- sprintf("`%s`", all.input.names)
    the.formula <- sprintf("`%s` ~ %s", outcome.name, paste(input.names.delineated, 
        collapse = " + "))
    if (return.as == "formula") {
        return(as.formula(the.formula))
    }
    if (return.as != "formula") {
        return(the.formula)
    }
}

Reduce Formula Function

reduce.formula <- function(dat, the.initial.formula, max.categories = NA) {
    require(data.table)
    dat <- setDT(dat)
    
    the.sides <- strsplit(x = the.initial.formula, split = "~")[[1]]
    lhs <- trimws(x = the.sides[1], which = "both")
    lhs.original <- gsub(pattern = "`", replacement = "", 
        x = lhs)
    if (!(lhs.original %in% names(dat))) {
        return("Error:  Outcome variable is not in names(dat).")
    }
    
    the.pieces.untrimmed <- strsplit(x = the.sides[2], split = "+", 
        fixed = TRUE)[[1]]
    the.pieces.untrimmed.2 <- gsub(pattern = "`", replacement = "", 
        x = the.pieces.untrimmed, fixed = TRUE)
    the.pieces.in.names <- trimws(x = the.pieces.untrimmed.2, 
        which = "both")
    
    the.pieces <- the.pieces.in.names[the.pieces.in.names %in% 
        names(dat)]
    num.variables <- length(the.pieces)
    include.pieces <- logical(num.variables)
    
    for (i in 1:num.variables) {
        unique.values <- dat[, unique(get(the.pieces[i]))]
        num.unique.values <- length(unique.values)
        if (num.unique.values >= 2) {
            include.pieces[i] <- TRUE
        }
        if (!is.na(max.categories)) {
            if (dat[, is.character(get(the.pieces[i])) | 
                is.factor(get(the.pieces[i]))] == TRUE) {
                if (num.unique.values > max.categories) {
                  include.pieces[i] <- FALSE
                }
            }
        }
    }
    pieces.rhs <- sprintf("`%s`", the.pieces[include.pieces == 
        TRUE])
    rhs <- paste(pieces.rhs, collapse = " + ")
    the.formula <- sprintf("%s ~ %s", lhs, rhs)
    return(the.formula)
}

Regression Wrapper Functions

We will build some useful tools for implementing and summarizing linear and logistic regressions:

linear.regression.summary <- function(lm.mod, digits = 3, alpha = 0.05) {
  lm.coefs <- as.data.table(summary(lm.mod)$coefficients,
  keep.rownames = TRUE)
  setnames(x = lm.coefs, old = "rn", new = "Variable")
  z <- qnorm(p = 1 - alpha/2, mean = 0, sd = 1)
  lm.coefs[, Coef.Lower.95 := Estimate - z * `Std. Error`]
  lm.coefs[, Coef.Upper.95 := Estimate + z * `Std. Error`]
  return(lm.coefs)
}
logistic.regression.summary <- function(glm.mod, digits = 3, alpha = 0.05) {
  glm.coefs <- as.data.table(summary(glm.mod)$coefficients,
  keep.rownames = TRUE)
  setnames(x = glm.coefs, old = "rn", new = "Variable")
  z <- qnorm(p = 1 - alpha/2, mean = 0, sd = 1)
  glm.coefs[, Odds.Ratio := exp(Estimate)]
  glm.coefs[, OR.Lower.95 := exp(Estimate - z * `Std. Error`)]
  glm.coefs[, OR.Upper.95 := exp(Estimate + z * `Std. Error`)]
  return(glm.coefs[])
}

Fitting Functions

round.numerics <- function(x, digits) {
    if (is.numeric(x)) {
        x <- round(x = x, digits = digits)
    }
    return(x)
}
fit.model <- function(dat, the.initial.formula, model.type, 
    digits = 3) {
    the.formula <- reduce.formula(dat = dat, the.initial.formula = the.initial.formula)
    if (model.type == "logistic") {
        mod <- glm(formula = the.formula, family = "binomial", 
            data = dat)
        mod.summary <- logistic.regression.summary(glm.mod = mod, 
            digits = digits)
    }
    if (model.type == "linear") {
        mod <- lm(formula = the.formula, data = dat)
        mod.summary <- linear.regression.summary(lm.mod = mod, 
            digits = digits)
    }
    mod.summary.rounded <- mod.summary[, lapply(X = .SD, 
        FUN = "round.numerics", digits = digits)]
    return(list(summary = mod.summary.rounded, obj = mod))
}

Linear Price Model

price.name <- "price"
old.borough.name <- "neighbourhood_group_cleansed"
borough.name <- "Borough"
input.names <- c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, borough.name)
setnames(x = dat, old = old.borough.name, new = borough.name)
formula.price <- create.formula(outcome.name = price.name, 
    input.names = input.names)
mod.lm <- fit.model(dat = dat[get(training.row.name) == 
    TRUE, ], the.initial.formula = formula.price, model.type = "linear")

Price Model Results

datatable(data = mod.lm$summary, rownames = FALSE)

The prices of the rentals are positively associated with increases in the number of bedrooms, bathrooms, and in the boroughs of Brooklyn, Manhattan, and Queens (relative to the Bronx). A cozy description was associated with lower prices. All of these effects are statistically significant. Staten Island has lower prices, but the effect is not statistically significant and may in fact be the same as in the Bronx.

The Accuracy of Linear Regression’s Predictions

pred.lm <- predict.lm(object = mod.lm$obj, newdata = dat[get(training.row.name) == 
    FALSE])
rmse.lm <- my.rmse(predicted = pred.lm, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(round(x = rmse.lm, digits = 1))
[1] 87

This is an improvement on the standard deviation of the prices: 106.

In other words, measuring the information in the predicting variables leads to better predictions!

Logistic Regression

Logistic Regression: Setting

Logistic Regression: High Rental Prices

high.price.name <- "High Price"
threshold.high.price <- 200
dat[, eval(high.price.name) := (get(price.name) >= threshold.high.price)]
formula.high.price <- create.formula(outcome.name = high.price.name, 
    input.names = input.names)
mod.logistic <- fit.model(dat = dat[get(training.row.name) == 
    TRUE, ], the.initial.formula = formula.high.price, model.type = "logistic")

Model Results

datatable(data = mod.logistic$summary, rownames = FALSE)

Additional bedrooms, bathrooms, and a location in Manhattan, Brooklyn, or Queens (relative to the Bronx) are all greatly associated with increases in the odds of having a high-priced rental. A cozy description was associated with a decrease in odds of a high priced rental. All of these effects are statistically significant, while Staten Island has no significant difference with the Bronx in terms of these odds.

Logistic Regression’s Predictions

pred.logistic <- predict.glm(object = mod.logistic$obj, 
    newdata = dat[get(training.row.name) == FALSE], type = "response")
rmse.logistic <- my.rmse(predicted = pred.logistic, actual = dat[get(training.row.name) == 
    FALSE, get(high.price.name)])
print(round(x = rmse.logistic, digits = 2))
[1] 0.32

Logistic Regression’s Classifications

classifications.logistic <- 1 * (pred.logistic >= 0.5)
correct.pct.logistic <- percentage.correctly.classified(predicted = classifications.logistic, 
    actual = dat[get(training.row.name) == FALSE, get(high.price.name)])
print(round(x = correct.pct.logistic, digits = 3))
[1] 0.876

The logistic regression correctly classified the testing set’s outcomes a reasonably high percentage of the time.

Cox Proportional Hazards Regression

The Setting of Cox Regression

Conditional Hazard of the outcome at time \(t\) given inputs \(X\) = Baseline Hazard of the outcome times \(exp(\displaystyle\sum_{k = 1}^{m}\beta_k X_k)\)

The Hazard Ratio

A Detour: The HIV Data from Lecture 3

time.name <- "time"
outcome.name <- "censor"
tx.name <- "tx"
sex.name <- "sex"
age.name <- "age"
ivdrug.name <- "ivdrug"
cd4.name <- "cd4"
predictor.names <- c(tx.name, sex.name, age.name, ivdrug.name, 
    cd4.name)
hiv.dat <- fread(input = "AIDS data.csv")

Plotting Survival Curves

We’ll display the comparative survival curves free of AIDs (where 1 is 100%) for the two treatment groups:

library(survival)
surv.obj <- Surv(time = hiv.dat[, get(time.name)], event = hiv.dat[, 
    get(outcome.name)])
plot(survfit(formula = surv.obj ~ hiv.dat[, get(tx.name)]), 
    las = 1, lty = c(1, 2), col = c(1, 2), ylim = c(0.8, 
        1), xlab = "Days After Baseline", ylab = "Survival Free of AIDS", 
    main = "3-Drug Cocktail RCT: Survival Curves")
legend(x = "bottomleft", legend = c("3-Drug Cocktail", "2-Drug Control"), 
    lty = c(2, 1), col = c(2, 1))

Cox Regression of AIDs-Free Survival

the.formula <- as.formula(sprintf("surv.obj ~ %s", paste(sprintf("hiv.dat$%s", 
    predictor.names), collapse = "+")))
coxmod <- coxph(formula = the.formula)
the.coefs <- as.data.table(summary(coxmod)$coefficients, 
    keep.rownames = TRUE)
setnames(x = the.coefs, old = names(the.coefs), new = c("Variable", 
    "Coefficient", "Hazard Ratio", "Std. Error", "Z", "P"))
alpha <- 0.05
zval <- qnorm(p = 1 - alpha/2, mean = 0, sd = 1)
the.coefs[, `:=`(HR.Lower.95, exp(Coefficient - zval * `Std. Error`))]
the.coefs[, `:=`(HR.Upper.95, exp(Coefficient + zval * `Std. Error`))]
the.coefs[, `:=`(Variable, gsub(pattern = "hiv.dat$", replacement = "", 
    x = Variable, fixed = TRUE))]

Interpreting the Coefficients

datatable(data = the.coefs[, lapply(X = .SD, FUN = "round.numerics", 
    digits = 3)], rownames = FALSE)

The 3-Drug Cocktail significantly reduces the hazard of AIDS or Death while adjusting for a variety of additional risk factors.

Regularized Regression: Motivation

Types of Regularization and Variable Selection Algorithms

Ridge Regression’s Setting

Lasso Regression’s Setting

Multiple Outcomes

Both Ridge and Lasso regression can be used in a variety of settings:

Understanding the Penalty

Selecting \(\lambda\)

A Tradeoff Between Bias and Variance

Understanding the Tradeoffs

K-Fold Cross Validation

Selecting \(\lambda\) with K-Fold Cross Validation

Regularized Regression with glmnet

create.x.and.y <- function(the.formula, data) {
    require(data.table)
    setDT(data)
    x <- model.matrix(object = as.formula(the.formula), 
        data = data)
    y.name <- trimws(x = gsub(pattern = "`", replacement = "", 
        x = strsplit(x = the.formula, split = "~")[[1]][1], 
        fixed = TRUE))
    y <- data[as.numeric(rownames(x)), get(y.name)]
    return(list(x = x, y = y))
}
my.regularized.model <- function(the.formula, training.data, 
    testing.data, alpha = 0, family = "gaussian") {
    library(glmnet)
    x.y.train <- create.x.and.y(the.formula = formula.price, 
        data = training.data)
    mod <- glmnet(x = x.y.train$x, y = x.y.train$y, family = family, 
        alpha = alpha)
    the.coefs <- as.data.table(mod$beta[, ncol(mod$beta)], 
        keep.rownames = TRUE)
    setnames(x = the.coefs, old = names(the.coefs), new = c("Variable", 
        "Beta"))
    
    x.y.test <- create.x.and.y(the.formula = the.formula, 
        data = testing.data)
    pred <- predict(object = mod, newx = x.y.test$x, type = "response")
    the.rmse <- my.rmse(predicted = pred[, ncol(pred)], 
        actual = x.y.test$y)
    return(list(coefs = the.coefs, pred = pred, rmse = the.rmse, 
        lambda = mod$lambda[length(mod$lambda)]))
}

Ridge Regression on AirBnB

mod.ridge <- my.regularized.model(the.formula = formula.price, 
    training.data = dat[get(training.row.name) == TRUE, 
        ], testing.data = dat[get(training.row.name) == 
        FALSE, ], alpha = 0, family = "gaussian")
datatable(data = mod.ridge$coefs[, lapply(X = .SD, FUN = "round.numerics", 
    digits = 3)], rownames = FALSE)

Lasso Regression on AirBnB

mod.lasso <- my.regularized.model(the.formula = formula.price, 
    training.data = dat[get(training.row.name) == TRUE, 
        ], testing.data = dat[get(training.row.name) == 
        FALSE, ], alpha = 1, family = "gaussian")
datatable(data = mod.lasso$coefs[, lapply(X = .SD, FUN = "round.numerics", 
    digits = 3)], rownames = FALSE)

Tree-Based Models

Multiple Outcomes

Decision Trees: The Simple Version

Making Predictions or Classifications with a Tree

A Decision Tree Model of Price

library(rpart)
mod.rpart <- rpart(formula = formula.price, data = dat[get(training.row.name) == 
    TRUE, ])
pred.rpart <- predict(object = mod.rpart, newdata = dat[get(training.row.name) == 
    FALSE, ], type = "vector")
rmse.rpart <- my.rmse(predicted = pred.rpart, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.rpart)
[1] 80.15542

Disadvantages of Decision Trees

A Big Advantage of Decision Trees

Decision Tree Model with Coordinates

rating.name <- "review_scores_rating"
formula.price.lat.long <- create.formula(outcome.name = price.name, 
    input.names = c(num.bedrooms.name, num.bathrooms.name, 
        cozy.name, lat.name, long.name, rating.name))
mod.rpart.lat.long <- rpart(formula = formula.price.lat.long, 
    data = dat[get(training.row.name) == TRUE, ])
pred.rpart.lat.long <- predict(object = mod.rpart.lat.long, 
    newdata = dat[get(training.row.name) == FALSE, ], type = "vector")
rmse.rpart.lat.long <- my.rmse(predicted = pred.rpart.lat.long, 
    actual = dat[get(training.row.name) == FALSE, get(price.name)])

res <- data.table(`Tree with Borough RMSE` = rmse.rpart, 
    `Tree with Coordinates RMSE` = rmse.rpart.lat.long)
datatable(data = res[, lapply(X = .SD, FUN = "round.numerics", 
    digits = 3)], rownames = FALSE)

Improving on a Simple Decision Tree

Many Trees

The Bootstrap

Bootstrapped Metrics

Bagging

A Problem with Bagging

Random Forests

Fitting a Random Forest to AirBnB’s Prices

library(randomForest)
mod.rf <- randomForest(formula = as.formula(formula.price.lat.long), 
    data = dat[get(training.row.name) == TRUE, ])
pred.rf <- predict(object = mod.rf, newdata = dat[get(training.row.name) == 
    FALSE, ])
rmse.rf <- my.rmse(predicted = pred.rf, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.rf)
[1] 71.27092

This is a substantial improvement over the linear regression and simple tree-based models.

Gradient Boosting: Algorithm

Gradient Boosting: Concepts

Fitting a Gradient Boosting Model to AirBnB’s Prices

library(xgboost)
dmat.train <- xgb.DMatrix(data = data.matrix(dat[get(training.row.name) == 
    TRUE, .SD, .SDcols = c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, lat.name, long.name, rating.name)]), label = dat[get(training.row.name) == 
    TRUE, get(price.name)])
dmat.test <- xgb.DMatrix(data = data.matrix(dat[get(training.row.name) == 
    FALSE, .SD, .SDcols = c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, lat.name, long.name, rating.name)]), label = dat[get(training.row.name) == 
    FALSE, get(price.name)])

mod.xgb <- xgboost(data = dmat.train, eta = c(0.05, 0.1), 
    gamma = c(1, 2), nrounds = 40, verbose = 0)
pred.xgb <- predict(object = mod.xgb, newdata = dmat.test)
rmse.xgb <- my.rmse(predicted = pred.xgb, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.xgb)
[1] 75.0919

Other Models

K Nearest Neighbors

K Nearest Neighbors on the AirBnB Prices

library(FNN)
mod.knn <- knn.reg(train = dat[get(training.row.name) == 
    TRUE, .SD, .SDcols = c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, lat.name, long.name, rating.name)], test = dat[get(training.row.name) == 
    FALSE, .SD, .SDcols = c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, lat.name, long.name, rating.name)], y = dat[get(training.row.name) == 
    TRUE, get(price.name)], k = 20)
rmse.knn <- my.rmse(predicted = mod.knn$pred, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.knn)
[1] 76.89153

Because we are including latitude and longitude, the nearest neighbors are likely to be similar units as close as possible to those we are making predictions on.

Support Vector Machines

Support Vector Machines on AirBnB’s Prices

library(e1071)
mod.svm <- svm(formula = as.formula(formula.price.lat.long), 
    data = dat[get(training.row.name) == TRUE, ])
pred.svm <- predict(object = mod.svm, newdata = dat[get(training.row.name) == 
    FALSE, .SD, .SDcols = c(num.bedrooms.name, num.bathrooms.name, 
    cozy.name, lat.name, long.name, rating.name)])
rmse.svm <- my.rmse(predicted = pred.svm, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.svm)
[1] 74.82019

Neural Networks

A Motivating Example: The XOR function

Linear Regression Struggles with XOR

dt <- data.table(x = c(0, 0, 1, 1), y = c(0, 1, 0, 1), xor = c(0, 
    1, 1, 0))
fit.model(dat = dt, the.initial.formula = "xor ~ x + y", 
    model.type = "linear")$summary
      Variable Estimate Std. Error t value Pr(>|t|) Coef.Lower.95
1: (Intercept)      0.5      0.866   0.577    0.667        -1.197
2:           x      0.0      1.000   0.000    1.000        -1.960
3:           y      0.0      1.000   0.000    1.000        -1.960
   Coef.Upper.95
1:         2.197
2:         1.960
3:         1.960

Fitting a Neural Network to AirBnB’s Prices

library(nnet)
mod.nnet <- nnet(formula = as.formula(formula.price.lat.long), 
    data = dat[get(training.row.name) == TRUE, ], size = 5)
# weights:  41
initial  value 654983004.544586 
final  value 652297576.000000 
converged
pred.nnet <- predict(object = mod.nnet, newdata = dat[get(training.row.name) == 
    FALSE, ])
rmse.nnet <- my.rmse(predicted = pred.nnet, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
print(rmse.nnet)
[1] 170.1958

Interestingly enough, the neural network does not perform well in this case. That was traditionally the issue. It is only through sharply increased complexity that deep learning applications can generate successful results.

Ensembled Methods

An Ensembled Model of AirBnB’s Prices

pred.ensemble <- rowMeans(cbind(pred.rf, pred.xgb, pred.svm))
rmse.ensemble <- my.rmse(predicted = pred.ensemble, actual = dat[get(training.row.name) == 
    FALSE, get(price.name)])
rmse.ensemble
[1] 72.66922

Getting Better Models

A Summary of the Methods

This is just my opinion!

Practical Considerations

When you pursue machine learning as part of your work with an organization, it always requires a balance among:

Machine Learning Competitions

For this reason, many competitions make a point of hiding the testing set that is used for evaluation.

A More Practical Machine Learning Competition

Your Midterm Project